Automatic bilingual lexicon acquisition using random indexing of parallel corpora
نویسندگان
چکیده
This paper presents a very simple and effective approach to using parallel corpora for automatic bilingual lexicon acquisition. The approach, which uses the Random Indexing vector space methodology, is based on finding correlations between terms based on their distributional characteristics. The approach requires a minimum of preprocessing and linguistic knowledge, and is efficient, fast and scalable. In this paper, we explain how our approach differs from traditional cooccurrence-based word alignment algorithms, and we demonstrate how to extract bilingual lexica using the Random Indexing approach applied to aligned parallel data. The acquired lexica are evaluated by comparing them to manually compiled gold standards, and we report overlap of around 60%. We also discuss methodological problems with evaluating lexical resources of this kind. 1 Lexical Resources Should Be Dynamic Lexical resources are necessary for any type of natural language processing and language engineering applications. Where in the early days of language engineering lexical information may have been hard-coded into the system, today most systems and applications rely on explicitly introduced and modularly designed lexica to function: examples range from applications such as automatic speech recognizers, dialogue systems, information retrieval, and writing aids to computational linguistic techniques such as part-of-speech tagging, automatic thesaurus construction, and word sense disambiguation systems. Multilingual applications, which are driven by modeling lexical correspondences between different human languages, are obviously reliant on lexical resources to a high degree — the quality of the lexicon is the main bottleneck for quality of performance and coverage of service. Unfortunately, machine readable lexica in general, and machine readable multilingual lexica in particular, are difficult to come across. Manual approaches to lexicon construction vouch for high quality results, but are timeand labour-consuming to build, costly and complex to maintain, and inherently static: tuning an existing lexicon to a new domain is a complex task that risks compromising existing information and corrupting usefulness for previous 2 M. Sahlgren and J. Karlgren application areas. Automatic lexicon acquisition techniques, on the other hand, promise to provide fast, cheap and dynamic alternatives to manual approaches, but typically require sizeable computational resources and have yet to prove their potential in practical application. This paper introduces a simple and effective approach to using distributional statistics over parallellized bilingual corpora for automatic multilingual lexicon acquisition. The approach is efficient, fast and scalable, and is easily adapted to new domains and to new languages. We evaluate the proposed methodology by first extracting bilingual lexica from aligned Swedish–Spanish and English–German data, and then comparing the acquired lexica to manually compiled gold standards. The results clearly demonstrate the viability of the approach. 2 Cooccurrence-based Bilingual Lexicon Acquisition Cooccurrence-based bilingual lexicon acquisition models typically assume something along the lines: “If we disregard the unassuming little grammatical words, we will, for the vast majority of sentences, find precisely one word representing any one given word in the parallel text. Counterterms do not necessarily constitute the same part of speech or even belong to the same word class; remarkably often, corresponding terms can be identified even where the syntactical structure is not isomorphic.” (Karlgren, 1988) or alternatively formulated: “... words that are translations of each other are more likely to appear in corresponding bitext regions than other pairs of words.” (Melamed, 2000) These models, first implemented by Brown and colleagues (Brown et al., 1988), use aligned parallel corpora, and posit a translational relation between terms that are observed to occur with similar distributions in corresponding text segments. The calculation of correspondence between segments is an art in itself, in that translations typically are not done clause-by-clause from reasons of informational flow, stylistic finesse, cultural differences, and translator preference; arguably cannot be done term-by-term by virtue of the characteristics of human language; and seldom are consistent and correct for any longer stretch of texts due to limited stack space of human translators. Given a set of text segments in two (or more) languages — here arbitrarily called the source and the target languages — that have been aligned satisfactorily, correspondences between term occurrences across the source and target texts can be calculated in several ways. Most approaches calculate cooccurrence scores between terms using various related formulas — the number of times a term ws from the source language occurs in a source text segment Ts(i) folded together with the number of times a candidate term wt occurs in the corresponding text segment 1 Naturally, nothing precludes these two sets of text segments from being written in the same language. Automatic Bilingual Lexicon Acquisition 3 in the target language Tt(i). These scores are taken as the primary datum, and similarities between terms are then calculated using these scores as a base. There are major drawbacks to this approach. Most importantly, as has been noted by several practictioners of the field, the assumption that translation involves correspondences between single lexical items is a patent oversimplification. While the operationalization of translation as a correspondence between a lexical item in the source language with an item in the target language certainly is convenient for present purposes, the assumption that meaning is compositional from morpheme level upwards is contestable. For us, the primary meaning bearing unit is the utterance, the coherent expression of something meaningful by a speaker or a writer. Our model does not make explicit use of cooccurrence between lexical items in corresponding languages, but rather of individual occurrences of lexical items in contexts of use. Our models model utterances, not lexical items — in the bilingual case, translated pairs of utterances. Confounding the model by trying to model collocational data separately from cooccurrence across languages does not improve its theoretical basis. Modeling occurrence data is basic; lexical items that occur in the same clause within one language are indelibly related through that syntagmatic relation — however the relation is modeled by linguists — and the entire utterance bears a relation to the translation of it in the target language. In this sense, every term of the source utterance is related to every term in target utterance, even if their relative import may differ by orders of magnitude. Unrefined term occurrence data are, however, unsatisfactory means to model semantic relations because terms are polysemous and have synonyms as a matter of course, and a matrix of term-by-term relations will mostly contain empty cells, modeling nothing but lack of relevant observations. Thus the high dimensionality of such models will, through its low level of abstraction, obscure semantically salient dependencies — the space of semantic variation appears in some sense to be of a lower inherent dimensionality. Aside from underlying assumptions of compositionality and distributional semantics, methodologies that take term cooccurrences as primary data will have to address practical issues that have little to do with meaningful semantic correspondence. They will inevitably run into scalability and tractability problems when presented with real data comprising hundreds of thousands of term tokens in millions of texts. These problems will compound all the more rapidly when the data are multilingual (Gale and Church, 1991). 3 Context-based Bilingual Lexicon Acquisition Our approach, by contrast, takes the context — an utterance, a window of adjacency, or when necessary, an entire document — as the primary unit. Rather than building a huge vector space of contexts by lexical item types, as most retrieval systems do, implicitly or explicitly, we build a vector space which is large enough to accommodate the occurrence information of tens of thousands of lexical item types in millions of contexts, yet compact enough to be tractable; constant in size in face of ever-growing data sizes; and designed to model association between dis4 M. Sahlgren and J. Karlgren tributionally similar lexical items without compilation or explicit dimensionality reduction. To illustrate the difference between purely cooccurrence-based approaches and our context-based approach consider the following table. Ts Context Tt a a a b c c 1 x v y z z z z a d e 2 v w z a a a c 3 x x v Table 1. Example of parallel data. Brown’s original model (Brown et al., 1988), (Brown et al., 1990) and Melamed’s somewhat later model (Melamed, 2000), to take two examples, differ in how they calculate the cooccurrence of ws and wt. Brown’s cooccurrence measure is proportional to the product of their joint frequencies in each step; Melamed’s dampens the measure by only taking the smaller of the frequencies. In this case, Brown’s model would find a and z having the closest cooccurrence score — with the score for context 1 dominating everything else; Melamed’s model would find a and v to be the closest, with the number of contexts they engage in dominating everything else; our model will — under typical parameter settings — find that a has the closest cooccurrence score with contexts 1 and 3, that x also has the same context profile, and that they thus are the most closely corresponding terms. In other similar experiments, other term distribution measures have been used to temper and modulate the effects of the pure cooccurrence measure. Similarity metrics that weight together raw cooccurrence with global occurrence statistics (under the assumption that a term that occurs often elsewhere in other contexts is a bad candidate); term length in characters, (under the assumption that semantically similar terms tend to have similar graphotactic appearance); term position in the respective text segments (under the assumption that source and target languages have similar syntactic characteristics) have all been tested, often usefully (Karlgren et al., 1994) — but these different types of information sources often require weeding out weak translation candidates using filters such as other lexical resources, e.g. based on lexical categorization of terms in terms of part-of-speech. Currently, we do not implement such filters, in keeping with our principle of association being a relation between terms and utterances rather than between terms and terms.
منابع مشابه
Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data
This paper presents a very simple and effective approach to automatic bilingual lexicon acquisition. The approach is cooccurrence-based, and uses the Random Indexing vector space methodology applied to aligned bilingual data. The approach is simple, efficient and scalable, and generate promising results when compared to a manually compiled lexicon. The paper also discusses some of the methodolo...
متن کاملDynamic Lexica for Query Translation
This experiment tests a simple, scalable, and effective approach to building a domain-specific translation lexicon using distributional statistics over parallellized bilingual corpora. A bilingual lexicon is extracted from aligned Swedish-French data, used to translate CLEF topics from Swedish to French, which resulting French queries are then in turn used to retrieve documents from the French ...
متن کاملData-driven Amharic-English Bilingual Lexicon Acquisition
This paper describes a simple approach of statistical language modelling for bilingual lexicon acquisition from Amharic-English parallel corpora. The goal is to induce a seed translation lexicon from sentence-aligned corpora. The seed translation lexicon contains matches of Amharic lexemes to weekly inflected English words. Purely statistical measures of term distribution are used as the basis ...
متن کاملCross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a sm...
متن کاملAutomatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese-English Corpora
This paper presents a hybrid approach to deriving a translation lexicon from unaligned parallel Chinese-English corpora. Two types of information, namely, proximity and document-external distributions of word pairs, are proposed to enhance the precision of the translation lexicon derived from statistical and dictionary-based methods. The former can identify translations of Chinese compounds, wh...
متن کاملSemi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites
For the purpose of overcoming resource scarcity bottleneck in corpus-based translation knowledge acquisition research, this paper takes an approach of semi-automatically acquiring domain specific translation knowledge from the collection of bilingual news articles on WWW news sites. This paper presents results of applying standard co-occurrence frequency based techniques of estimating bilingual...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Natural Language Engineering
دوره 11 شماره
صفحات -
تاریخ انتشار 2005